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Abstract. We study the problem of indexing text with wildcard positions, motivated by the 
challenge of aligning sequencing data to large genomes that contain millions of single nucleotide 
polymorphisms (SNPs) — positions known to differ between individuals. SNPs modeled as wildcards 
can lead to more informed and biologically relevant alignments. We improve the space complexity 
of previous approaches by giving a succinct index requiring (2 + o(l))n log a + 0(n) + 0(d log n) + 
0(k log k) bits for a text of length n over an alphabet of size a containing d groups of k wildcards. 
A key to the space reduction is a result we give showing how any compressed suffix array can 
be supplemented with auxiliary data structures occupying 0(n) + 0(d\og ^) bits to also support 
efficient dictionary matching queries. The query algorithm for our wildcard index is faster than 
previous approaches using reasonable working space. More importantly our new algorithm greatly 
reduces the query working space to 0(dm + mlogn) bits. We note that compared to previous 
results this reduces the working space by two orders of magnitude when aligning short read data 
to the Human genome. 

(N 

1 Introduction 

CO, 

The study of strings, their properties, and associated algorithms has played a key role in ad- 
vancing our understanding of problems in areas such as compression, text mining, information 
retrieval, and pattern matching, amongst numerous others. A most basic and widely studied 
question in stringolgy asks: given a string T (the text) does it contain a string P (the pattern) 
as a substring? It is well known that this problem can be solved in time proportional to the 
\£) \ lengths of both strings [TO]. However, it is often the case that we wish to repeat this question 

for many different pattern strings and a fixed text T of length n over an alphabet of size a. 
. The idea is to create a full-text index for T so that repeated queries can be answered in time 

proportional to the length of P alone. It was first shown by Weiner [18] in 1973 that the suffix 
tree data structure could be built in linear time for exactly this purpose. The ensuing years 
have seen the versatility of the suffix tree as it has been demonstrated to solve numerous other 
related problems. 

While suffix trees use 0(n) words of space in theory, this does not translate to a space 
efficient data structure in practice. For this reason, Manber and Myers [12] proposed the suffix 
array data structure (see Figure [1]). Though a great practical improvement over suffix trees, the 
i?(relogra) bit space requirement is often prohibitive for larger texts. Building in part on the 
pioneering work of Jacobson [9] into succinct data structures, two seminal papers helped usher 
in the study of so-called succinct full-text indexes. Grossi and Vitter [7] proposed a compressed 
suffix array that occupies 0(n log a) bits; the same space required to represent the original string 
T. Soon after, Ferragina and Manzini [5] proposed the FM-index, a type of compressed suffix 
array that can be inferred from the Burrows- Wheeler transform of the text and some auxiliary 
structures, leading to a space occupancy proportional to nH^iT) bits, where -fffc(T) denotes 
the k th order empirical entropy of T. These and subsequent results have made it possible to 
efficiently answer the substring question on texts as large, or larger, than the Human genome. 

We are interested in designing a succinct index to answer a generalized version of the sub- 
string question where the text T contains k wildcard positions that can match any character of 
a pattern. Our motivation arises in the context of aligning short read data, produced by second 
generation sequencing technology. Typically short reads are aligned against a so-called reference 
genome; however, the quantity of positions known to differ between individuals due to single 
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nucleotide polymorphisms (SNPs) numbers in the millions [6j. Modeling SNPs as wildcards 
would yield more informed, and by extension, more accurate alignment of short reads. 

Cole, Gottlieb & Lewenstein [3] were one of the first to study the problem of indexing 
text sequences containing wildcards and proposed an index using 0(nlog fc n) words of space 
capable of answering queries in 0(m + log fe n log log n + occ) time. This result was later improved 
by Lam et al., [11] resulting in space usage of only 0(n) words and a query time no longer 
exponential in k. A key idea in their work was to build a type of dictionary of the text segments 
of T = T\(f) kl T2(t) k ' 2 ■ ■ ■ (f) kd Td+i where each text segment T{ contains no wildcards and 4> ki denotes 
the i th wildcard group of size hi > 1, for 1 < i < d < k. The query time includes the term 
7 = £ijPrefix(P[i..|P|],T j ) where preV\x(P[i..\P\],T j ) = 1 if Tj is a prefix of P[i..\P\] and 
otherwise. The authors also give a more detailed bound on 7 based on prefix complexity. 

Despite this improvement, 0(n) words of space is prohibitive for texts as large as the Hu- 
man genome. Support for dictionary matching of text segments was also crucial in the approach 
of Tarn et al, [IT] who proposed the first, and to our knowledge only, succinct index. They 
designed a dictionary structure using (2 + o(l))n log a bits, based on a compressed suffix ar- 
ray, which therefore occupies most of the space required by their overall index. Very recently, 
Belazzougui pQ proposed a succincter dictionary based on the Aho-Corasick automaton having 
optimal query time. The compressed space occupancy was further improved by a slight mod- 
ification given by Hon et al, [8\. While these results are impressive, the wildcard matching 
problem benefits from an index that can report the text segments contained in P (dictionary 
problem), as well as the text segments which are prefixed by P and also fully contain P. To 
draw a distinction, we will refer to this latter type as a full-text dictionary. In our first main 
contribution we show how a full-text dictionary can be built on top of any compressed suffix 
array using an additional 0(n) + 0(d log ^) bits of space, and in turn how it can be used to 
provide a succincter index for texts containing wildcards. We note that our dictionary does not 
require any modification of the original string T. 

In our view, the main challenge that must be overcome for successful wildcard matching is a 
reduction of the query working space. The fastest solution of Tarn et al, [17], matches our query 
time, if modified to use the same orthogonal range query structure we use, but requires a query 
working space of 0{n\ogd + m\ogn) bits. Acknowledging that the first term is impractical for 
large texts, they give a slower solution that reduces the working space to be proportional to 
the index itself. This makes the solution feasible, but constraining considering the fact that p 
parallel queries necessarily increases the working space by a factor of p. In our second main 
contribution we give an algorithm that reduces the query working complexity significantly to 
0{dm + mlogn) bits. For our motivating problem, alignment of short reads (32-64 bases) to 
the Human genome (3 billion bases with 1-2 million SNPs), this reduces the working space by 
two orders of magnitude from gigabytes to tens of megabytes. Our result for indexing text with 
wildcards is summarized and compared with existing results in Table [1] 

2 Preliminaries 

Let T[l, n] be a string over a finite alphabet £ of size a. We denote its j character by T[j] and 
a substring from the i th to the j th position by T[i..j]. We assume that an end-of-text sentinel 
character $ ^ U has been appended to T (T[n] = $) and $ is lexicographically smaller than any 
character in U. For any substring X we use \X\ to denote its length and X to denote its reverse 
sequence. The suffix array SA of T is a permutation of the integers [l,n] giving the increasing 
lexicographical order of the suffixes of T. Conceptually SA can be thought of as a matrix of all 
suffixes of T that have been sorted lexicographically and where SA[i] = j means that the i th 
lexicographically smallest suffix of T begins position j. 

A string X has a suffix array (SA) range [a, b] with respect to SA if a — 1 (n — b) suffixes 
of T are lexicographically smaller (larger) than X. If a > b the range is said to be empty and 
X does not exist as a substring of T; otherwise, X occurs as a prefix of the b — a + 1 suffixes 
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Table 1. A comparison of text indexes supporting wildcard characters, k, d, d is the # of 
wildcards, wildcard groups, and distinct wildcard group lengths, respectively; occi,occ2,occ is 
the # of Type 1, Type 2, and overall occurrences, respectively; 7 = Yli j prefix(P[i..|P|], Tj), ] 
= our result 



of T denoted by its range. The SA range for X can be found in a compressed suffix array 
by backwards search using the LF-mapping which relates SA to T BWT , the Burrows- Wheeler 
transform of T. T BWT is also a string of length n where T BVT [i] = T[SA[i] — 1], if SA[i] ^ 1, and 
T BWT [f] = $ otherwise. See Figure [1] for an example. For details of backwards search, the LF- 
mapping, existing implementations, and related topics we refer the reader to the excellent review 
by Navarro and Makinen [T3]. In this work, we assume the availability of a compressed suffix 
array meeting the following space and time requirements, of which there are many (c./. |14|). 

Lemma 1. A compressed suffix array SA for T can be represented in (1 + o(l))nlog<r bits 
of space, such that the suffix array range of every suffix of a string X can be computed in 
0(|X|log<j) time, and each match of X in T can be reported in an additional O(logn) time. 

In our dictionary construction, we also make use of the following well known data structures. 

Lemma 2 (Raman et al., |16j). A bit vector B of length n containing d 1 bits can be repre- 
sented in (flog ^ + 0(d + n ^°f^°^ - ) bits to support the operations ranki(B, i) giving the number 
of 1 bits appearing in B[l..i] and selecti(B,i) giving the position of the i th 1 inB in 0(1) time. 

Lemma 3 (Grossi & Vitter |7j). An array L of d integers where Ylf=i = n can ^ e 
represented in d([lg(n/d)] + 2 + o(l)) bits to support 0(1) time access to any element. 

Lemma 4 (Munro &; Raman |13j). A sequence BP of d balanced parentheses can be repre- 
sented in (2+o(l))d bits of space to support the following operations in 0(1) time: rank((BP,i) ; 
select((BP, i), and similarly for right parentheses, as well as: 

— f indclose(BP, I): index of matching right parenthesis for left parenthesis at position I 

— enclose(BP, i): indexes (l,r) of closest matching pair to enclose (i, f indclose(BP, i)) if 
such a pair exists and returns an undefined interval in BP otherwise 

The matching statistics for a string X with respect to SA is an array ms of tuples such 
that ms[i] = (q, [a,b]) states that the longest prefix of that matches anywhere in T 

has length q and suffix array range [a, b]. Very recently Ohlebusch et al, [15J showed matching 
statistics can be efficiently computed with backward search if SA is enhanced with auxiliary data 
structures using 0(n) bits to represent so-called longest common prefix intervals (c.f [15]). We 
leverage this result in the design of our succinct full-text dictionary and its search algorithm. 

Lemma 5 (Ohlebusch et al., |15j). The matching statistics of a pattern X with respect to 
text T over an alphabet of size a can be computed in 0(|X|log<r) time given a compressed 
enhanced suffix array ofT. 
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Finally, our wildcard matching algorithm makes use of an orthogonal range query data structure. 

Lemma 6 (Bose et al., [2]). A set N of points from universe M = [l..k] x [l..fc], where 
k = \N\, can be represented in (1 + o(l))k\ogk bits to support orthogonal range reporting in 
Oiocc fc ) time, where occ is the size of the output. 

3 A succinct full-text dictionary 

In the dictionary problem we are required to index a set of d text segmental T> = {Ti, T2, . . . , T^} 
so that we can efficiently match in any input string P all occurrences of text segments belonging 
to V. We present a succinct full-text dictionary index that is also capable of efficiently identifying 
all text segments that contain P as a prefix, or more generally as a substring. We demonstrate 
the use of this additional functionality in our solution for wildcard matching. 

3.1 A compressed suffix array representation of text segments 

Let T = (f>T\(j)T24>T^(j) . . . <j)Td% be the concatenation of all d text segments, each prefixed by the 
character <p, followed by the traditional end-of-text sentinel $, having total length n. Note that 
n is necessarily larger than the total number of character in the dictionary. We define <j> to be 
lexicographically smaller than any c £ £ and $ to be lexicographically smaller than (p. We first 
build SA, the compressed suffix array for T. Consider any text segment Tj 6 T>. There will be a 
contiguous range [c, d] of suffixes in S A that are prefixed by the string Tj . Lemma [7J summarizes 
how we can use the SA range of Tj and its length to determine if it is prefix of a given text P 
(and vice versa). 

Lemma 7. Let SA be the compressed suffix array for T and let [a, b] and [c, d] be the non-empty 
suffix array ranges in SA for a string P and a text segment Tj respectively. Then Tj is a prefix 
of P if and only ifc<a<b<d and \P\ > \Tj\. Similarly, P is a prefix of Tj if and only if 
a < c < d < b. 

3.2 Storing text segment lengths 

For Lemma [7] to apply, we must know both the SA range of a given text segment and also its 
length. By Lemma [3] we can store the lengths of all d text segments in a compressed integer 
array L using d( [log (n/<i)] + 2 + o(l)) bits ensuring constant time access. We store the lengths 
in L relative to the lexicographical order of text segments. 

3.3 The text segment interval tree 

The SA range of one text segment Tj will enclose the SA range of another Tj if T is a prefix of 
Tj. For instance, in the example of Figure [1] the text segment aca has SA range [15, 15] and is 
enclosed by the SA range of the text segment ac ([14, 16]) and by the text segment a ([8, 16]). 
In general, it is also possible that many text segments begin at the same position, provided that 
they are different occurrences of the same string (e.g., aa). This is by design since each text 
segment is followed by a character not found in £ (either <f> or $). However, our construction 
requires us to distinguish between different occurrences of the same text segment string and 
we therefore introduce the concept of text segment intervals. When t > 1 text segments in the 
dictionary share a common SA range we say that the text segment interval of occurrence a 
encloses the text segment interval of occurrence 6, 1 < a 7^ b < t, if the suffix of T beginning 
with occurrence a is lexicographically smaller than the suffix beginning with occurrence b. In 
this way we are able to define a total order on all d text segment intervals based on their relative 



1 To remain consistent with the section that follows we refer to dictionary entries (patterns) as text segments. 
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Fig. 1. A succinct full-text dictionary for the set of text segments {aa, aca, a, aa, cacc, ac}. 
Shown are the sorted suffixes of the string T = 0aa</>aca<^a</>aa<^cacc</>ac$ representing the text 
segments. Text segment intervals are demarcated on the left and labeled by their lexicographical 
order (lex id) and the text segment they represent. 



lexicographical order in SA. We assign lex ids, a unique identifier for each text segment, based 
on this lexicographical order. Consider again the example in Figure [TJ The text segment aa 
occurs as a prefix of T[2..n] and T[ll..n]. Since the suffix T[2..n] is lexicographically smaller 
than T[10..n], we say that the occurrence prefixing T[2..n] encloses the other. Consequently, the 
text segment prefixing T[2..n] (T[ll..n]) is assigned lex id 2 (3). We will refer to text segments 
or text segment intervals interchangeably. 

In general the text segment intervals form a set of nested, non-crossing intervals (an interval 
tree) and can be represented by a sequence BP of d balanced parentheses; one pair for each text 
segment (see Figure [T]) . Conceptually, if we can identify the text segment interval having the 
largest lex id that is a prefix of P, referred to as the smallest enclosing text segment interval of 
P, then we can immediately conclude that P is also prefixed by all intervals which enclose it. 

Lemma 8. Given the index pair (l,r) in BP corresponding to the smallest enclosing text seg- 
ment interval for a string P the occ number of text segments that are prefixes of P can be 
counted in 0(1) time and reported in an additional 0(occ) time. 

3.4 Finding the smallest enclosing text segment interval 

We now describe how the smallest enclosing text segment interval can be determined given any 
non-empty SA range [a,b] in SA for P. We wish to determine the pair (l,r) of indexes for the 
left and right parentheses in BP corresponding to this interval (or an undefined index range 
if P is not prefixed by any text segment). Unfortunately, we cannot directly infer where text 
segment intervals begin and end based on T BWT alone. Therefore, we make use of a bit vector B 
of length n and set B[k] = 1 if and only if one or more text segment intervals begin at position 
k, or end at position k — 1. For the range [a, b], end cases occur when B[k] = 0, a < k < n (all 
text segment intervals end before position a) or when B[k] = 0, 1 < k < a (all text segment 
intervals begin after position a). Suppose otherwise and let c = argmax 1<J<a {B[j] = 1} and 
d = argmin a<J<n {B[j] = 1}. Note that position c marks the largest position (up to a) when one 
or more text segment intervals begin or end (at c— 1). Our algorithm considers two main cases: 
either B[c] marks the beginning of one or more intervals, or it only marks the end of intervals. 
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Algorithm 1 Find smallest enclosing text segment interval 



Input: a specifies the beginning of the non-empty suffix array interval for string P 

Output: I, r where / (r) is the index of the left (right) parenthesis in BP corresponding to the smallest enclosing 
text segment interval of P if it exists, and an undefined interval otherwise 
c select i(B, ranki (B, a)) 



d ■!— selecti(B, ranki(B, a) + 1) 
if c or d is undefined then 

return an undefined interval 
lexid <- rank (T BWT , d - 1) 
if lexid > rank^T 8 ™, c) then 

if L[lexid] > \P\ then 

lexid <r- ranker 8 ™, c - 1) + 1 



I <r- 

l,r 
else 

I «- 

r <- 



select((BP, lexid) 
<— enclose(BP, I) 

select((BP, lexid) 
findclose(BP,Z) 



else 

r <r- select) (BP, R[ranki(B, 
/ <— f indopen(BP, r) 
l,r <— enclose(BP, I) 
return I, r 



')]) 



/ / handle end cases 



// B[c] marks beginning of t.s. interval(s) 



// B[c] marks end of t.s. interval(s) 



Lemma 9. Given two positions c and d of B, where c < d, B[c] = B[d] = 1 and B[k] = 0, 
c < k < d, then B[c] marks the beginning of t text segment intervals if and only ifT Bm [c..d — 1] 
contains t occurrences of the character (j). 



Using Lemma [9] we are able to distinguish between the two main cases. If B[c] marks the 
beginning of one or more text segment intervals, then Tj — the text segment interval with 
the largest lex id beginning at position c — is the smallest enclosing text segment interval, 
provided \Tj\ < \P\ (by condition of Lemma [7]). If \TA < \P\, we can determine the largest lex id 
beginning at position c by simply counting the occurrences of the character <f> prior to position 
d in T BV1 . Conveniently and by construction, this corresponds to the rank of the left parenthesis 
denoting Tj in BP. It is worth noting that when \TA > \P\ special care is required to find the 
smallest enclosing text segment interval in worst case constant time. Details are given in the 
proof of Lemma \W\ but the idea is to find the enclosing interval (if any) of the text segment 
interval having the smallest lex id beginning at position c. 

On the other hand, if B[c] only marks the end of one or more text segment intervals, we 
can instead identify the right index for Ty — the last text segment interval (smallest lex id) 
to end at position c — 1. The smallest enclosing text segment interval, if any, is therefore the 
one enclosing T,y . Unfortunately, in this case we cannot infer how many intervals close prior to 
position c directly from T Bm . For this reason, we will employ another compressed integer array 
R to record the count of intervals that close prior to position k, for all B[k] = 1. We determine 
the appropriate index for R by simply counting the number of l's up to position c in B. The 
corresponding entry in R gives us the rank of the right parenthesis for the last interval to close 
prior to position c, from which we can find the enclosing interval (if any). The entire procedure, 
including end cases, is summarized in Algorithm [1] and shown correct in Lemma [TUJ 



Lemma 10. Let SA be the compressed suffix array for T and let [a, b] be the non-empty suffix 
array range in SA for a string P. In 0(1) time, Algorithm [7] either correctly identifies the 
indexes in BP corresponding to the smallest enclosing text segment interval of P if one exists, 
or it returns an undefined interval when it does not. 
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3.5 The overall dictionary and its full-text capabilities 

We have shown how all text segments occurring as a prefix of a string P having a non-empty 
SA range in SA can be reported efficiently. By enhancing SA with lcp-interval information using 
0(n) bits, we can find the matching statistics for P in order to repeat the previous procedure for 

1 < i < \P\ (see LemmaEJ). Importantly for our results on wildcard matching, we note that with 
a very minor modification, this same construction works when text segments are separated by 
more than one (ft character and also when the first text segment is not preceded by a (ft character. 
Note that the text segment interval tree can be built in a similar manner as an lcp-interval tree. 
Details are left for the full version. We have our first main result. 

Theorem 1. Given a set of d text segments over an alphabet of size a we can construct a 
succinct full-text dictionary, based on an enhanced compressed suffix array, using at most (1 + 
o(l))nlog a + 0(n) + 0(d\og bits where n is the length of T , the text representation of the 
dictionary including (ft characters, such that the 7 text segments contained in a string P can 
be counted in 0([P[ logo") time and reported in an additional 0(7) time. Furthermore, all text 
segments prefixed by P can be reported in 0{\P\ logcr + occ) time, and all locations in T where 
P occurs as a substring can be reported in 0( \P\ log a + occ log n) time. 

4 Matching wildcards in succinct texts 

Let T be a string over an alphabet U U {(ft} of size a where (ft ^ S and T[i] = (ft if and only 
if position i is a wildcard position in T. In particular, we denote the structure of the input 
string as T = T\(ft kl T2(ft k2 ■ ■ ■ (ft kd Td + i where each text segment Tj contains no wildcards and (ft ki 
denotes the i th wildcard group of size fej > 1, for 1 < % < d. Our goal is to create an index 
for the purpose of identifying all the locations in T that exactly match any query pattern P, 
modulo wildcard positions. Similar to previous approaches [11117] , we classify the match into 
one of three cases: X contains no wildcard group (Type 1), X contains exactly one wildcard 
group (Type 2), and X contains more than one wildcard group (Type 3). 

4.1 Overall design of the index 

We first build the succinct full-text dictionary of Section [3l By design, the dictionary reports 
the match of a text segment Tj based on its lexicographical order (its lex id) relative to other 
text segments; however, in the wildcard problem we are required to report the match based on 
Tj's position in T. Therefore, we store a permutation 77 mapping the lex ids of text segments 
to their relative position order in T. For instance, if Tj has lex id k, then II [k] = j. We find it 
convenient to store the following information for each text segment, in auxiliary arrays, indexed 
by this relative position order: length, SA range in SA (referenced as RSA), beginning position in 
T, and the size of the preceding wildcard group. Note that array L of the dictionary construction 
can be adapted to store lengths in this relative order with the use of 77. We also construct a 
compressed suffix array SA for T, the reverse of T, and store the SA range of each Tj with 
respect to SA (referenced as RSA). Note that SA does not need to support location reporting. 
We use simple arrays to store SA ranges resulting in O(dlogn) bits combined space usage to 
store auxiliary information supporting constant time access. To support Type 2 matching we 
employ a range query data structure occupying (1 + o(l))k\ogk bits (see next section). 

Lemma 11. Given a text T of length n containing d groups of k wildcards the combined space 
required of the above indexes is (2 + o(l))nlog<7 + 0(n) + O(dlogn) + 0(klogk) bits. 

All three matching types make use of the matching statistics of P with respect to SA. Types 

2 and 3 matching also make use of the SA ranges of P with respect to S A. Both can be computed 
in 0(m log a) time (by Lemmas [T] and [5]) and require 0(m log n) bits to store. We incorporate 
these times and working space into the results for each type. Type 1 matching is handled by 
the application of Lemma [TJ 
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4.2 Type 2 matching 

A Type 2 match occurs when the alignment of P to T contains exactly (a portion of) one 
wildcard group. Specifically, we seek a pair of neighbouring text segments Tj and Tj+i, separated 
by a wildcard group of size kj, where P[z..|P|] aligns to the first |P| — i + 1 characters of Tj+i - 
referred to as the suffix match (of P) — and P[l..i — 1 — kj] aligns to the last i — l — kj characters 
of Tj — referred to as the prefix match. Let ctj {uij) be the the first (last) <j> character of the j th 
wildcard group in T. End cases occur when the match begins or ends in T[a£..o;'-], where a'j (u;') 
is the position of ay (ujj) in T. For now, suppose this is not the case. For a fixed suffix P[i..\P\] 
and wildcard group length kj our strategy will be to (i) find all potential suffix matches, (ii) 
record the lex id of the candidate text segments, (iii) find all potential prefix matches, and (iv) 
determine which candidate prefix matches are compatible with a lex id recorded in step (ii). 

• • • Tj 1 (/)■■■(/) i Tj+i • • • 

t t 

OLj Uj 

Lemma 12. Given a non-empty SA range [a, b] in SA for a string X, the lex ids (based on their 
lexicographical order) of text segments in T that contain X as a prefix will form a contiguous 
(possibly empty) range [id\,id2\ that can be reported in 0(1) time. 

By Lemma [121 we can identify the range [id\,id2\ of lex ids corresponding to text segments 
that P[i..|P|] is a prefix of in constant time using its stored SA range with respect to SA, 
completing steps (i)-(ii). Determining a range [ids,id^\ of lex ids corresponding to text segments 
that P[l..i — kj — 1] is a suffix of is equivalent to determining all Tt that contain P[l..i — kj — 1] 
as a prefix. Again, using a stored SA range with respect to SA this can be determined in constant 
time, completing step (iii). Now consider that the lex id with respect to SA of a text segment 
Tj + \ is relative to the rank of Uj in T BWT , the character which precedes it. Similarly, the relative 
rank of ay in T BWT determines the lex id of Tj, but in this case relative to T. We make use of 
a permutation H to relate these lex ids (a and u values). Specifically, we set H[oy] = ujj, for 
1 < j < k. Therefore, we need to determine the entries in H [1^3. .2^4] that have a value in the 
range [id\,id2\. This is an orthogonal range query and by Lemma [6j H can be represented in 
(1 + o(l))fclog A; bits to report all occ matches in 0(occ ^°fJ^ k ) time. Once a lex id ujj has been 
verified, a match position can be reported in 0(1) time as the location of Tj+i with respect to 
T is known in addition to the length of the prefix match. This completes step (iv). 

In general, we can repeat the above procedure for every combination of suffix length and 
wildcard group length bound by m. However, as pointed out by Tarn et al., [T7] the number of 
distinct wildcard group sizes d is often a small constant, particularly in genomic sequences. We 
therefore only consider at most d lengths, provided they are not larger than m. 

Now, consider the case when P[i..|P|] aligns to a prefix of a wildcard group. To contain 
P[z..|P|] as a prefix, the wildcard group must have a length I > \P\ — % + 1. Let a be the 
first entry in SA denoting a suffix of T prefixed by at least I — 1 (j) characters and let b be 
the last entry prefixed by any character. Then, similar to Lemma [T21 T BWI [a..b] will contain 
a range [idi,id 2 ] giving ranks of (f> characters in that interval. Some sub-sequence of [idi,id2] 
will correspond to oj wildcards that begin groups having length I or longer. Therefore, Type 2 
matches can be determined by reporting entries in H [2^3.. 2^4] having a value in [idi,id 2 ], where 
[ids, id^] is defined as before. The case when a prefix of P aligns as a suffix of a wildcard group 
can be handled similarly. Note that the SA ranges of the at most m wildcard group lengths we 
are interested in can be determined in 0(m logo") time and stored in 0{m log n) bits. 

Lemma 13. All Type 2 matches can be reported using O(mlogn) bits of working space in 
0(m (log a + min(m, d) lo l °f Q k k ) + occ 2 lo l °f Q k k ) time. 
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Algorithm 2 Report Type 3 matches 

Input: a string P of length m, its matching statistics w.r.t. SA, SA ranges for all suffixes of P w.r.t. SA 
Output: positions in T matching P, modulo wildcard positions 



f: for i = 1 to m do 

2: let (q, [a,b]) be the matching statistics for P[i..m] 

3: use Algorithm f to find indexes (I, r) in BP denoting smallest enclosing text segment interval for SA range 

[a,b] 

4: while (I, r) is a defined interval in BP do 

5: lexid <— rank( (BP, I) 

6: j <— II [lexid] 

7: [a p , bp], [a„, b„] <— SA range of P[l..i — f — fej-i] w.r.t SA, SA range of P[i + lj + kj..m] w.r.t SA 

8: [c s ,d s ], [c p , d p ] <- RSA[j - 1], RSA[j + 1] 

9: if i < lj-i + fcj-i then // Case f: P does not contain Tj-i 

fO: if kj-i > i — 1 or [a p ,b p ] encloses [c p ,d p ] then // Case 1: prefix condition satisfied 

f f : if m — i + 1 < l 3 1 + kj + lj+i — 1 then / / Case la: P does not contain Tj+\ 

12: if m — i < lj + kj or [a 3 , b s ] encloses [c s , d s ] then / / Case la: suffix condition satisfied 

13: print match at position Xj — i + 1 

14: else / / Case lb: P must contain Tj+i 

15: set (i + lj + kj) th bit of W[j + 1] to 1 

16: else // Case 2: P must contain Tj-i 

17: if i th bit of VJ[j] is set to 1 then / / Case 2: prefix condition is satisfied 

18: if m — i + 1 < l 3 1 + kj + — 1 then / / Case 2a: P does not contain Tj+\ 

19: if m — i < lj + kj or [a s , b s ] encloses [c s , d s ] then / / Case 2a: suffix condition satisfied 

20: print match at position x j — i + 1 

21: else // Case 2b: P must contain Tj+i 

22: set (i + lj + kj) th bit of W\j + 1] to 1 

23: (l,r) <- enclose(BP, 



Notation: Xj, lj, kj denotes the position, length and wildcard group length (which follows) the text segment Tj 



4.3 Type 3 matching 

Type 3 matches contain at least (portions of) two wildcard groups and therefore must fully 
contain at least one text segment. The general idea in previous approaches and in this paper is to 
consider this case as an extension of the dictionary matching problem: text segments contained 
within P are candidate positions, but we must verify if they can be extended to a full match of 
P. However, we execute this idea in an altogether novel manner that greatly reduces the working 
space over existing approaches. The complete details of our approach are given in Algorithm [2j 
We now highlight the main idea and give the intuition behind the correctness but note that a 
formal proof is given in the appendix. 

First, suppose that text segment Tj matches P starting at position i. Consider the conditions 
that must be satisfied to confirm that this match can be extended to a complete match of P in 
T. We must verify that (i) P[l..i — 1] can be matched to the text preceding Tj in T — referred 
to as the prefix condition — and (ii) P[i + |T^-|..|P|] can be matched to the text following Tj in 
T — referred to as the suffix condition. If both conditions are verified, we can report that P 
matches T at position Xj — i + 1, where Xj is the start position of Tj in T. 




For working space, we make use of an array W containing d + 1 entries (one for each text 
segment) of m bits, with all entries set to zero using the constant time initialization technique [3]. 
During the course of the algorithm the ? th bit of \N[j] is set to 1 if the prefix condition is true 
for P[l..i — 1] with respect to Tj. There are exactly m stages of the algorithm (i = 1, . . . ,m) 
corresponding to the suffixes of P. In a given stage i we consider each text segment Tj found 
to be a prefix of the i th suffix of P. To verify the prefix and suffix conditions for Tj we first 
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consider (line 9 of Algorithm [2]) : will P[l..i — 1] need to fully contain the previous text segment 
Tj-i in order to match in T? This breaks our algorithm into the two main cases. If not (Case 1), 
we check the prefix condition by checking whether P[l..i — 1] is compatible with the wildcard 
group to its left and the suffix of 2j-i to which it must align (line 10). If the prefix condition 
is satisfied, we consider (line 11): will P\i + \Tj\..m\ need to fully contain the next text segment 
Tj + \ in order to match in T? If not (Case la), we check whether the suffix condition is satisfied 
by checking that P[i + |T^|..m] is compatible with the wildcard group to its right and the prefix 
of Tj + \ to which it must align (line 12). If indeed the suffix condition is satisfied, we output a 
match (line 13). If yes (Case lb), we set the (i + lj + kj) th bit of entry W[j + 1] to 1, to indicate 
that a prefix condition holds for P[l..i + lj + kj — 1] with respect to Tj + \ (line 15). The key 
idea here is that we only attempt to verify the suffix condition when Tj would be the last text 
segment to occur in P (i.e., Case la) and if not (Case lb), we record information in W stating 
that we currently have a partial match, but for it to remain viable, 2*+i should be a prefix of 
P[i + lj + kj..m]. Case 2 occurs when P must contain the previous text segment Tj—% to satisfy 
the prefix condition (lines 16-22). Since stages of the algorithm proceed with increasing values 
of i, then the prefix condition would have been previously checked and, if satisfied, the i th bit 
of \N[j] would be set to 1. The remaining questions are answered as before: the suffix condition 
is verified if possible, and otherwise successful partial matches are again recorded in W. 

Lemma 14. All Type 3 matches can be reported in 0(m log a + 7) time using 0(dm + mlogn) 
bits of working space. 

Combining the results for the 3 types of matching we arrive at our second main result. 

Theorem 2. Given a text T of length n containing d groups of k wildcards all matches of a 
pattern P of length m can be reported using 0(dm+m log n) bits of working space in 0(m(log a+ 
min(m, d) ^'"^ k ) + occ\ log n + occ 2 k + 7) time with an index occupying (2 + o(l))nlog<7 + 
0(n) + 0(d log n) + 0(k log k) bits of space. 

Acknowledgments. The author would like to thank Anne Condon for helpful discussions, 
detailed feedback and suggestions on this manuscript. 
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A Supporting Proofs 
Proof of Lemma [7] 

Proof. We first consider the case for determining if Tj is a prefix of P. Suppose that Tj is a 
prefix of P. Then it must be the case that \Tj\ < \P\. By definition T[SA[c}..\T\] (T[SA[d]..\T\}) 
is lexicographically smaller (greater) than any other suffix of T prefixed by the string T~; thus, 
[c, d] must enclose [a, b] and we have c < a < b < d. 

Next consider the case when c < a < b < d and \Tj\ < \P\. Since [c, d] encloses [a,b] they 
must share a common prefix of length min(|P|, \Tj\). If [a, b] = [c, d] it could be the case that P 
is a proper prefix of Tj] however, since \P\ > \Tj\ then P and Tj must share a common prefix 
of length at least \Tj\. Thus, Tj is a prefix of T. 

The other case is symmetric, but it is not necessary to compare the lengths of P and Tj. □ 

Proof of Lemma 1121 

Proof. This follows from the proof of Lemma [7] and by the definition of lex ids since they 
correspond to (ft characters (which prefix text segment occurrences only) in T BWT that must 
necessarily be contained within the SA range for those text segment occurrences. □ 

Proof of Lemma [8] 

Proof. We let I\ denote the interval in BP specified by (l,r). If I\ is an undefined interval then 
P is not prefixed by any text segment (occ = 0) and we are done. Suppose I\ is defined. This 
interval is enclosed by another interval I2 = (p, q) if and only if p < I and q > r. Since text 
segment intervals cannot cross, if I2 opens before I\ (p < I) it is either the case that I2 closes 
before I\ opens (q < I) or I2 closes after I\ closes (q > r); it is the latter case we are interested in. 
We count the number of intervals that begin (opening parentheses), up to index I, and subtract 
the number which also end (closing parentheses), up to index I. The difference is exactly the 
number of enclosing intervals for I\. Specifically, occ = rank((BP,Z) — rank)(BP,/) and can be 
computed in O(l) time. 

Reporting the text segment match for interval l\ consists of outputting a tuple containing 
(start, end, lexid). The lexid is the lexicographical order of the text segment (relative to others) 
and is determined in O(l) time as lexid = rank((BP, /). Since we report text segments that are 
prefixes of T, then start = 1 and end = start + L[Zexi<i] — 1 (as lengths of text segments are 
stored in L according to their lex id). After reporting the match for I\, we can determine the 
next enclosing interval by setting (Z,r) = enclose(BP, I) and repeating the above procedure 
until all occ occurrences have been reported. □ 

Proof of Lemma [9] 

Proof. Suppose t text segment intervals begin at position c. As previously stated, if two or 
more text segment intervals begin at the same position then they are different occurrences of 
the same text segment string uj. By definition of B, no other text segment interval can begin 
before position d in SA. If B[d] marks the beginning of another text segment interval, it must 
be lexicographically larger than oj and therefore all t occurrences of oj appear before position 
d. If B[d] instead marks the end of one or more text segment intervals (at position d — 1), it 
must be for the t occurrences of to since text segment intervals cannot cross. In either case, all 
occurrences of the text segment oj must appear in SA in the range [c.d— 1] (possibly in addition 
to other suffixes of T prefixed by the string oj). Since only text segment instances are prefixed 
by the character (ft in T, then T BVT [c..d — 1] must contain exactly t occurrences of (ft. 

Suppose T BV1 [c..d — 1] contains t occurrences of the character (ft. Since each text segment 
occurrence is prefixed by the character (ft in T, then t suffixes of T in the range [c.d — 1] of SA 



13 



are prefixed by text segment occurrences. Each text segment occurrence corresponds to one text 
segment interval. Text segment intervals only begin in positions k where B[k] = 1. Therefore t 
text segment intervals begin at position c as no other text segment intervals can begin before 
position d, by definition of B. □ 

Proof of Lemma 1101 (Algorithm 1 - Find smallest enclosing text segment interval) 

Proof. Algorithm [TJ begins by identifying the last entry in B up to position a and the first entry 
after position a equal to 1 denoting the opening or closing of text segment intervals. If either of 
these are undefined, then a text segment interval cannot enclose [a, b] and an empty interval is 
returned (lines 3-4). 

If T BVT [c..d— 1] contains one or more <fi characters then by Lemma El B[c] marks the beginning 
of some number of text segment intervals (lines 6-13). Since text segment interval lex ids are 
based on their lexicographical order in SA, then the lex id of the last text segment interval to 
open at position c is lexid, given by the count of (j> characters up to position d — 1 in T BWT . 
Let Tf~ be this text segment interval. By Lemma we must also ensure that \P\ > \T}~\ by 
checking the text segment length in L (line 7). If P is shorter than T/% (lines 8-10), then it is also 
shorter than all text segment intervals beginning at position c since they represent the same 
text segment string. However, it is possible that there exists a text segment interval Tj that is a 
longest proper prefix of Note that |P| > \Tj\, since it must be lexicographically larger than 
Tj; otherwise B[c] would correspond to this interval instead of If Tj exists, it would enclose 
the first text segment interval that begins at position c. We can find the lex id for the first text 
segment interval opening at position c (smallest lex id) similarly to but instead we count 
the occurrences of <fi prior to position c and then add one. The lex id will correspond to the 
rank of the left parenthesis in BP and the index I is easily determined by a select operation. 
Note that the enclose operation will return an undefined interval if Tj does not exist. If instead 
\Tk\ < \P\ (lines 12-13), we can simply determine the index for the left parenthesis denoting 
the text segment interval T^. 

Otherwise, B[c] only marks the end of some text segment interval(s) (lines 15-17). In this 
case, we use the number of occurrences of l's in B up to position c as an index into the array 
R which stores the number of text segment intervals that close prior to the position denoted by 
that entry. This allows us to identify Tp., the last text segment interval to close prior to position 
a (the one having the smallest lex id). If another text segment interval Tj encloses T)%, then it 
must be the case that Tj encloses [a, b] and \Tj\ < \P\, otherwise Tj would also close prior to 
position a. 

At this point, the pair (l,r) either correctly identifies the smallest enclosing text segment 
interval for the SA range [a, b], or it is an undefined interval if none exists. Overall, a constant 
number of operations are required and all can be computed in O(l) time. □ 

Proof of Lemma 1111 (Space analysis of our succinct wildcard index) 

Proof. The succinct full-text dictionary requires (1 + o(l))nlog<r + 0{n) + 0{d\og ^) bits by 
Theorem[TJ which in turn is based on a combination of Lemmas 1-4, and the additional 0(n) bits 
required to enhance SA with lcp-interval information and to store the LCP array. The wildcard 
index also requires a suffix array of the reverse of string T which occupies (1 + o(l))nlog<r 
bits by Lemma [TJ The most space dominant auxiliary array is used to store suffix array ranges 
in O(dlogn) bits. The range query data structure requires O(klogk) bits by Lemma Thus, 
overall we have a space complexity of (2 + o(l))n log a + 0(n) + 0(<ilog n) + 0(k log A;) bits. □ 

Proof of Lemma 1141 (Algorithm 2 - Type 3 Matching) 

Proof. Recall that the algorithm proceeds in m stages for increasing i = 1, . . . , m for each suffix 
of P. It is clear in the algorithm description that verification of a match of Tj proceeds by first 
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ensuring the prefix condition can be satisfied (Case 1: if P does not contain 7}-i) or ensuring it 
was previously satisfied (Case 2: P must contain 7}-i), and then verifying the suffix condition in 
the cases where P does not contain Tj + \ (Cases la, 2a) (and reporting a match when verified), 
or by instead marking W to signify a partial match, expecting the match to be continued by a 
match of Ij+i at the time step i + L + kj (Cases lb, 2b). The correctness relies on showing that 
W is set correctly to confirm the satisfaction of the prefix condition for the next text segment 
(Tj + i) for a future time step. We show correctness by induction on i. Consider the base case 
(i = 1). All candidate text segments Tj fall into Case 1 which (importantly) does not rely on 
the correctness of previous steps of the algorithm. The prefix condition is trivially true. Thus, 
if a successful match of P[l..m] to T[xj..n] will not fully contain Tj + \ we can simply check if 
P[lj + kj + l..m] is a prefix of Tj+i by Lemma [71 If it is, both conditions have been satisfied 
and we have a match, otherwise, we record in W[j + 1] that Tj+i must appear as a prefix of 
P[lj + kj + l..m] to form a successful match. Now assume we are in step i and the algorithm 
is correct up to step i — 1. Case 1 is handled as before and does not rely on the correctness of 
previous steps, so assume we are in Case 2 (P must contain Tj-{). Then, if the prefix condition 
is satisfied the i th bit of W[j] should be set to 1. Since this would have been set at some step 
t < i, and we have assumed the algorithm is correct up to step i — 1, then it must be the case 
that the prefix condition for Tj is satisfied if and only if W[j] has bit i set to 1. Similarly to 
before, if the prefix condition is satisfied, we can attempt to verify the suffix condition using 
Lemma [7] when P does not contain Tj + \ or by recording the partial match in W as before. This 
completes the correctness proof. 

We now consider the additional runtime and work space incurred for Type 3 matching. There 
are 7 candidate positions overall that can be reported in 0{m log d+7) time by Theorem[TJ Each 
candidate is processed once, in 0(1) time. The array W occupies 0(dm) bits as working space. 
Thus, the overall time complexity is 0(m logo" + 7) and working space is 0(dm + mlogn). □ 



